library(tidyverse)
library(ggplot2)
library(ggpubr)
library(kableExtra)
library(weathermetrics)
Let’s load the mpg dataset from ggplot2.
mpg contains a subset of the fuel economy data. The data
frame has 234 rows and 11 variables:
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
The glimpse() function can be used to get an overview of
the dataframe structure. It is similar to str(), but more
readable.
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
# for comparison
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
Are there any variables that do not have the class you expect it to?
The categorical variables are listed as character variables
Make any changes you feel are necessary.
# method 1: using lapply()
# find and return chr variables by:
# 1. use sapply() and is.character to find chr columns and return them as logical (TRUE is chr)
# 2. filter the original dataframe to select only the chr variables
# 3. obtain the chr variable names with colnames()
chr_variables <- colnames(mpg[,sapply(mpg,is.character)])
# convert chr variables to factors
mpg[chr_variables] <- lapply(mpg[chr_variables], as.factor)
# method 2: using mutate_if(), quicker and much more readable
mpg <- mpg %>%
mutate_if(is.character, as.factor)
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <fct> audi, audi, audi, audi, audi, audi, audi, audi, audi, aud…
## $ model <fct> a4, a4, a4, a4, a4, a4, a4, a4 quattro, a4 quattro, a4 qu…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <fct> auto(l5), manual(m5), manual(m6), auto(av), auto(l5), man…
## $ drv <fct> f, f, f, f, f, f, f, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, r, …
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <fct> p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, r, …
## $ class <fct> compact, compact, compact, compact, compact, compact, com…
Below hwy is summarised by vehicle class, by asking for
commonly used summary statistics.
mpg %>%
group_by(class) %>% # by class
summarise(min = min(hwy), # minimum of hwy
mean = mean(hwy), # mean
sd = sd(hwy), # standard deviation
max = max(hwy), # maximum
n = n()) %>% # nr of observations
kbl() %>%
kable_styling(latex_options = c("striped", "hover"), full_width = F)
| class | min | mean | sd | max | n |
|---|---|---|---|---|---|
| 2seater | 23 | 24.80000 | 1.303840 | 26 | 5 |
| compact | 23 | 28.29787 | 3.781620 | 44 | 47 |
| midsize | 23 | 27.29268 | 2.135930 | 32 | 41 |
| minivan | 17 | 22.36364 | 2.062655 | 24 | 11 |
| pickup | 12 | 16.87879 | 2.274280 | 22 | 33 |
| subcompact | 20 | 28.14286 | 5.375012 | 44 | 35 |
| suv | 12 | 18.12903 | 2.977973 | 27 | 62 |
Create a summary of engine displacement by year, including the minimum, maximum, median, and inter quantile range.
mpg %>%
group_by(year) %>% # by year
summarise(min = min(displ), # minimum of engine disp
max = max(displ), # maximum
Q1 = quantile(displ, 0.25), # 1st quartile
median = median(displ), # median, 2nd quartile
Q3 = quantile(displ, 0.75) # 3rd quartile
) %>%
kbl() %>%
kable_styling(latex_options = c("striped", "hover"), full_width = F)
| year | min | max | Q1 | median | Q3 |
|---|---|---|---|---|---|
| 1999 | 1.6 | 6.5 | 2.2 | 3.0 | 4.0 |
| 2008 | 1.8 | 7.0 | 2.5 | 3.6 | 4.7 |
# testing base R functions plot(), hist(), barplot()
hist(mpg$displ)
barplot(table(mpg$cyl))
plot(x = mpg$displ, y = mpg$hwy,
xlab = "Highway mpg",
ylab = "Engine displacement (L)")
General ggplot template:
```{r}
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>)) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION> +
<THEME>
```
ggplot(iris) + # Data
geom_point(mapping = aes(x = Sepal.Length, # Variable on the x-axis
y = Sepal.Width, # Variable on the y-axis
colour = Species)) + # Legend
labs(x = "Sepal Length",
y = "Sepal Width",
title = "Relationship between Sepal Length and Width by Species") +
coord_cartesian() + # Default standard for mapping x and y
facet_wrap(~Species) + # Splits plot by the species variable
theme_bw() # Sets the background theme
ggplot(mpg) + # data element (mpg)
geom_bar(aes(x = drv)) # geom_function (bar plot)
Create a bar chart that shows a count for the different vehicle classes in mpg.
ggplot(mpg) + # data element (mpg)
geom_bar(aes(x = class)) # geom_function (bar plot) with variable 'class' to be shown on axis x.
Look up different ggplot themes, and apply one to the bar chart you just created.
ggplot(mpg) + # data element (mpg)
geom_bar(aes(x = class)) + # geom_function (bar plot) with variable 'class' to be shown on axis x
theme_minimal()
ggplot(mpg) + # data element (mpg)
geom_bar(aes(x = class)) + # geom_function (bar plot) with variable 'class' to be shown on axis x
theme_pubclean()
ggplot(mpg) +
geom_histogram(aes(x = cty),
binwidth = 3)
What happens when you change the value of binwidth to 1? Does the distribution change?
ggplot(mpg) +
geom_histogram(aes(x = cty),
binwidth = 1)
A binwidth of 1 is thinner than the one of 3, meaning it
corresponds to a shorter interval of values. The distribution does not
change significantly, but it becomes less smooth. It also reveals two
interesting patterns: first, a peak at around cty = 21,
almost as high as the interquartile ranges of Q1 = 14 and Q3 = 19, which
was not visible in the previous histogram. Secondly, the outliers on the
tail become more visible, there are two values of cty
greater than 30 with very few cases that could be investigated.
ggplot(mpg, aes(x = cty)) +
geom_density(fill = "darkseagreen")
Add rug marks to plot above by adding the argument + geom_rug(size = 1, colour = “darkorange”)
ggplot(mpg, aes(x = cty)) +
geom_density(fill = "darkseagreen") +
geom_rug(linewidth = 1,
colour = "darkorange")
# geom_point plots values of displ on the x axis and values of hwy on the y axis
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy))
Adding a third variable as colour.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy,
colour = class))
Repeat the plot above, but mapping the type of transmission to the colour aesthetic. Add a theme and change the titles of the x- and y-axes.
# title of axis can be accessed by `labs` (labels)
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy,
colour = trans)) +
labs(x = "Engine Displacement in litres",
y = "Highway miles per gallon") +
theme_minimal()
Create a scatter plot showing the relationship between engine displacement and city miles, including a line of best fit.
# title of axis can be accessed by `labs` (labels)
# aes() must be inside ggplot now, x and y are required by geom_smooth()
ggplot(mpg, aes(x = displ,
y = cty)) +
geom_point() +
geom_smooth() +
labs(x = "Engine Displacement in litres",
y = "City miles per gallon") +
theme_minimal()
Recreate the previous plot, but adding colour = class to the aesthetic mappings. Use method = “lm” and set se = FALSE in geom_smooth(). What difference does this make?
# title of axis can be accessed by `labs` (labels)
# aes() must be inside ggplot now, x and y are required by geom_smooth()
ggplot(mpg, aes(x = displ,
y = cty,
colour = class)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE) +
labs(x = "Engine Displacement in litres",
y = "City miles per gallon") +
theme_minimal()
There are three modifications. First, adding colour to the aesthetic
mappings creates clusters using the class variable. As a
result, geom_smooth() fits separate lines for each of these
colour clusters. Second, the method = "lm" argument stands
for linear model and finds a linear relationship in the data, fitting a
linear line, instead of a line full of curves. Third, the argument
se = FALSE displays no confidence interval around the
smooth.
For comparison, let’s try plotting without se set to
FALSE.
ggplot(mpg, aes(x = displ,
y = cty,
colour = class)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Engine Displacement in litres",
y = "City miles per gallon") +
theme_minimal()
The plot is less readable with the display of overlapping confidence intervals.
Boxplots are often used to explore the distribution of a continuous variable broken down by a categorical variable.
ggplot(mpg) +
geom_boxplot(aes(x = drv,
y = hwy))
These features of the boxplot are:
What do each of these features tell us?
Add + coord_flip() to the plot above. What does this do?
ggplot(mpg) +
geom_boxplot(aes(x = drv,
y = hwy)) +
coord_flip()
coord_flip() flips the x and y axes, making the reading
of bloxplot charts horizontal instead of vertical.
What can you conclude about the highway miles per gallon of each type of drive train from the boxplot above?
Drive train “r” (rear-wheel) has a median of highway miles per gallon of a bit greater than 20. There are no outliers for this type of drive.
“f” stands for drive train type front-wheel. The group has the highest median of all three types of drive trains, a bit less than 30 highway miles per gallon. It also has many outliers, the dots beyond the whiskers, both beyond the maximum and minimum whiskers.
Drive train type “4” has no outliers. The group has the smallest median, making less than 20 highway miles per gallon.
The variable trans contains 10 different transmissions
types. These 10 categories can be assumed under 2 broader categories:
manual and auto. Use mutate() and
fct_collapse() to collapse the 10 categories of trans to
just these 2 categories.
# find the unique transmission types
unique(mpg$trans)
## [1] auto(l5) manual(m5) manual(m6) auto(av) auto(s6) auto(l4)
## [7] auto(l3) auto(l6) auto(s5) auto(s4)
## 10 Levels: auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) ... manual(m6)
mpg <- mpg %>%
mutate(trans = fct_collapse(trans,
"manual" = c("manual(m5)",
"manual(m6)"),
"auto" = c("auto(av)",
"auto(l3)",
"auto(l4)",
"auto(l5)",
"auto(l6)",
"auto(s4)",
"auto(s5)",
"auto(s6)")))
# check if mutation works
unique(mpg$trans)
## [1] auto manual
## Levels: auto manual
Use the previous example to create a boxplot mapping cty on the y-axis and drv on the x-axis, this time adding the argument colour = trans to aes(). What has changed?
ggplot(mpg) +
geom_boxplot(aes(x = drv,
y = cty,
colour = trans))
Adding a colour variables to the aesthetic of the boxplot splits each
drive train into two groups: the manual group and the auto group. Since
we originally had 3 drive train groups, the plot now has 6 boxplots.
Drive train ‘f’ remains with outliers, both in auto and
manual trans. Drive train ‘r’ remains without outliers.
Drive train ‘4’ originally had no outliers, but splitting in trans
groups revealed outliers in the auto group. The median
varies between trans groups for all drive train types -
manual has a higher median of highway miles per gallon in
all cases.
Facets are a way to split your plot into many subplots according to some categorical variable.
To do this, you use the command facet_wrap() to facet by
a single variable. The first argument to facet_wrap() is a
formula initiated by ~ and a variable name.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy)) +
facet_wrap(~ class, nrow = 2)
Let’s plot the previous example without the facet for a fresher visual comparison.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy))
It is possible to facet on a combination of two variables using facet_grid(). Like before, the first argument is a formula, but containing two variable names separated by ~.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy)) +
facet_grid(drv ~ cyl)
Notice that there are empty cells in the bottom left facets above. What do you think this means?
It means there are no data points in the specific combination of the
two categorical variables used in the facet grid, that is to say, there
are no drive trains of r type (r = rear wheel drive) with 4
cylinders, nor with 5 cylinders. There are also empty cells in the
facets that represents the combination of drv = 4 (4wd) and
5 cylinders.
Create a scatter plot of displ and hwy, faceting by the of manufacturer name.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy)) +
facet_wrap(~ manufacturer, nrow = 2)
Using nrow = 2 consolidates all facets in two rows. This
is not a great choice for two reasons. First, there are a total of 15
different manufacturers, an odd number, making the layot of the facets
asymmetric. Second, the plots and crushed to fit the display in only two
lines, making them less readable. A nice choice is to use
nrow = 3, a number by which 15 is divisible, generating a
better layout and giving more horizontal space for each plot.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy)) +
facet_wrap(~ manufacturer, nrow = 3)
Change the names on the axes to be more informative.
ggplot(mpg) +
geom_point(aes(x = displ,
y = hwy)) +
labs(x = "Engine displacement, in litres",
y = "Highway miles per gallon") +
facet_wrap(~ manufacturer, nrow = 3)
If you have multiple plots that you want to arrange on the same page, you can use ggarrange() from the ggpubr package.The first step is creating some plots.
Create the following three plots: A. A bar plot showing counts of class, assigning class to the fill aesthetic.
B. A box plot of class by hwy per gallon, assigning class to the colour aesthetic.
C. A jittered scatterplot of hwy and cty miles per gallon, assigning class to the colour aesthetic.
Remember to save each plot as an object to be called upon later.
# plot A - the `fill` argument attributes a color to each class
plot_a <- ggplot(mpg) +
geom_bar(aes(x = class, fill = class)) +
labs(x = "Type of car",
y = "Count") +
theme_light()
# plot B
plot_b <- ggplot(mpg) +
geom_boxplot(aes(x = class,
y = hwy,
colour = class)) +
labs(x = "Type of car",
y = "Highway miles per gallon") +
theme_light()
# plot C
# The jitter geom is a convenient shortcut for geom_point(position = "jitter"). It adds a small amount of random variation to the location of each point, and is a useful way of handling overplotting caused by discreteness in smaller datasets.
plot_c <- ggplot(mpg) +
geom_jitter(aes(x = hwy,
y = cty,
colour = class)) +
labs(x = "Highway miles per gallon",
y = "City miles per gallon") +
theme_light()
D. Use ggarrange() to arrange the three plots in one space. ggarrange() takes the plots as arguments as well as ncol or nrow to customise the arrangement
ggarrange(plot_a,
plot_b,
plot_c)
The arranging of the plot can be improved by flipping the coordinates so that the names of the classes in the first two plots are readable. Let’s rewrite the code for A and B adding this property.
# plot A - the `fill` argument attributes a color to each class
plot_a_flip <- ggplot(mpg) +
geom_bar(aes(x = class, fill = class)) +
labs(x = "Type of car",
y = "Count") +
coord_flip() +
theme_light()
# plot B
plot_b_flip <- ggplot(mpg) +
geom_boxplot(aes(x = class,
y = hwy,
colour = class)) +
labs(x = "Type of car",
y = "Highway miles per gallon") +
coord_flip() +
theme_light()
ggarrange(plot_a_flip,
plot_b_flip,
plot_c,
ncol = 2,
nrow = 2)
E. Repeat the code from the previous question, adding common.legend = TRUE and legend = “bottom” to ggarrange().
common.legend = TRUE avoids repeating class legend and
makes the figure more readable.
ggarrange(plot_a_flip,
plot_b_flip,
plot_c,
ncol = 2,
nrow = 2,
common.legend = TRUE,
legend = "bottom")